Introduction

TED (Technology, Entertainment, Design) is an organization that posts talks online for free dissemination. TED was founded in 1984 as a conference and has developed a lot so far. And nowadays, it has become the symbol of idea gathering, rooted into our life everywhere around. Even though one barely hears of the TED talk, its famous slogan of “ideas worth spreading” and the typical starry sky at the start of each talk should be probably noticed. It is always inspiring to enjoy a TED talk after work or study in our daily life and every fan of TED talk must have his or her own evaluation criteria to rate TED talks with multifarious topics. Hence, these talks, full of wisdom, gain abundant reviews and comments from viewers, which can serve as implication of both popularity of the talks and those viewers’ interests.

Members in our group have always been fascinated by TED talks and the marvelous diversity of the contents (see the fantastic topic cloud below). We tend to focus on topics that we are interested in and share our thoughts about it. An idea therefore came to our mind that we can try to explore what exactly affects the popularity of a particular TED talk by combining the skills out of the data science course and some strategies to operate the data obtained.

## Create a word cloud for ted topics.
## Tidy the "tags" variable.
ted_topic = ted %>%
  select(no, views, year, month, day, tags) %>%
  mutate(tags = substr(tags,3,nchar(tags)-2),
         tags = str_split(tags, "', '")) %>% 
  unnest(tags) %>% 
  rename(topic = tags)

## The frequence of appearance for each topic
n_topic = ted_topic %>% 
  group_by(topic) %>%
  count() %>%
  arrange(desc(n)) %>%
  ungroup

## Create the word cloud
set.seed(1234)
wordcloud(words = n_topic$topic, 
          freq = n_topic$n, 
          min.freq = 1,
          max.words = 200, 
          random.order = FALSE, 
          rot.per = 0.35, 
          colors = brewer.pal(8, "Dark2"))

In this study, the collected dataset that covers the overall information about 2550 TED talks, hosted on the official TED.com website from June 26th, 2006 to September 21st, 2017. In this dataset, there are 17 variables, including number of views, number of comments, description of each talk, etc. Key variables are summarized as follows,

Based on the dataset, there are some points expected to be explored in this study. Firstly, this report will include basic description and operation of several remarkable variables among all 17. For instance, in the “tag” varibale, information about the topic (or theme) of that talk can be extracted, which can lay a foundation for further investigation of the association between the topic and popularity of a talk. In addition, the development of TED over time is anticipated to be presented by this means. Secondly, this report will review the connection of the popularity of a TED talk with other varibales. What makes a TED talk popular is the main focus in this study.

Methods

In this study, we follow a step-by-step analysis to reveal the association between the popularity and other variables. And the popularity of one TED talk is represented by the number of views (“views” variable). And most analyses involve the operation on strings within the dataset.

  1. Topic (or theme) of a TED talk is the first variable whose association with popularity will be evalueated since mostly viewers may prefer the TED talks with topics that they like. In this case, the topic of each TED talk can be extracted primarily from the “tag” variable for each TED talk. Furthermore, the topic of each TED talk can be summarized. In terms of those TED talks with multiple tagged topics, all topics are kept since it is realistic that one talk can be classified into various themes. At last, the most frequently appeared topics are ranked out with tables and figures made to show the top 10 topics with the most talks and the change of the favorite topics over time is also plotted.

  2. To explore the connection of popularity of a TED talk with the its speaker, we find out the top ten speaker occupations with most views using similar strategies to the topic analysis. In this part, we realized that some tems of occupation almost have the same meaning, i.e. “author” and “writer”, which appear in the dataset very frequently. Therefore, we decided to unify the occupation into “writer” in this case. It can also be noticed that one speaker may have more than one occupation, among which the first occupation is always the main occupation of the speaker, best representing his or her main focus. So the “speaker_occupation” variable is cleaned, leaving only the first occupation in the dataset. The association between TED talks’ popularity (number of views) and the occupation of speakers will be presented using a boxplot.

  3. The “ratings” variable incorporates abundant information about the reviews on each TED talk from viewers. We noticed that each talk may have different types of reviews with varying counts. It would be interesting to conduct a sentiment analysis to the rating. In this study, the “ratings” will be splitted by certain symbols to extract each type of review for each TED talk at first. Then sentiment of that talk will be calculated based on the positiveness of each review after combininig with “bing” and the counts. In addition, considering the specific content in the “ratings” variable, some positive word, such as “fascinating”, “convincing” or “interesting”, may also reflect the popularity of that talk. We therefore compare the top 10 talks with most views (most popular) with the top 10 talks with largest positive sentiments furthermore to check the level of match and assess whether sentiment really can reflect the popularity of TED talks.

  4. After foregoing anaylysis, we would like to take other covariates that are not invovled into consideration. Hence, we will further explore the linear realtionship between the number of views for each talk and other predictors including duration, languages translated, years, number of speakers and sentiment ratings to show factors that may affect the popularity. In this case, “year” is categorized into three arms, “before 2010” (2006-2009), “between 2010 and 2015” (2010-2014) and “after 2015” (2015-2017) and the first arm will serve as reference level.

Results and Analysis

  1. The association between Topic and Popularity

In this section, we try to find out how diverse topics affect the popularity of TED talks.

## filter top 10 topics with the most talks 
top10_topic = head(n_topic, 10)
## Visualize the result
gg_top10 = top10_topic %>%
  mutate(topic = fct_reorder(topic, n)) %>%
  ggplot(aes(x = topic, y = n, fill = topic)) +
    geom_bar(stat = "identity")

ggplotly(gg_top10) %>% 
  layout(xaxis=list(title = "TED Topic"),
         yaxis=list(title = "Number of talks"),
         title="Top10 most talked topics")
<<<<<<< HEAD
=======
>>>>>>> e5a862a366096e3818b2b4e53714c96ab788fb06

TED includes talks on 419 different topics. The figure above demonstrates the most 10 talked topics. Obviously, Technology is the most talked topic with 727 talks.

## Visualize the distribution of views for top 10 topics.
gg_talks_topics = ted_topic %>%
  filter(ted_topic$topic %in% top10_topic$topic) %>%
  mutate(topic = fct_reorder(topic, views)) %>%
  ggplot(aes(x = topic, y = views, fill = topic)) +
    geom_violin() +
    ylim(0, 5e+6) +
    stat_summary(fun.y = median, geom = "point", size = 2)

ggplotly(gg_talks_topics) %>% 
  layout(xaxis=list(title = "Topic"),
         yaxis=list(title = "Number of views for each video"),
         title="Views for the top10 topics")
<<<<<<< HEAD
=======
>>>>>>> e5a862a366096e3818b2b4e53714c96ab788fb06

This figure shows the distribution of views for top 10 topics. It can be found that all the distributions of views are heavily right skewed, which indicates that some of the talks are extremely popular. We reordered the distributions by the median of views. Among the most talked 10 topics, culture and business had the highest median number of views. Although TED talks about technology the most, audience show more interest in culture or business related talks.

## Visualize the number of TED topics through the years.
gg_talks_years = ted_topic %>%
  filter(ted_topic$topic %in% top10_topic$topic) %>%
  group_by(year, topic)%>%
  count() %>%
  ggplot(aes(x = year, y = n, group = topic, color = topic)) +
    geom_line() 

ggplotly(gg_talks_years) %>% 
  layout(xaxis=list(title = "Year"), 
         yaxis=list(title = "Number of talks"),
         title = "Talks of top10 topics across years")
<<<<<<< HEAD
=======
>>>>>>> e5a862a366096e3818b2b4e53714c96ab788fb06
## Visualize TED topics views through the years.
gg_frac_year = ted_topic %>%
  filter(ted_topic$topic %in% top10_topic$topic) %>%
  ggplot(aes(x = year, y = views, fill = topic)) +
    geom_bar(stat = "identity", position = "fill")

ggplotly(gg_frac_year) %>% 
  layout(xaxis=list(title="Year"),
         yaxis=list(title="Fraction"),
         title="Fraction of topics across years")
<<<<<<< HEAD
=======
>>>>>>> e5a862a366096e3818b2b4e53714c96ab788fb06

The first figure above shows how many videos talking about top 10 topics in each year. It seems that the topic “TEDx” are frequently mentioned in 2012. TEDx is a program supporting independent organizers who want to create a TED-like event in their own community. In 2012, a week-long event-TEDx Summit was held by the Doha Film Institute, the inaugural event gathered TEDx organizers from around the world for workshops, talks and cultural activities. It may result in the increase of TEDx talks in 2012. In 2016, we also see a peak in the increase of all the talks. There are several global events held in 2016, including “TED 2016 Dream”, which is a conference about ideas, happening February 15-19, 2016, in Vancouver, BC, Canada.

The second figure above shows the trends in the share of top 10 topics along the years. We can see that culture related talks have been viewed most in 2006 when the first six TED Talks were posted online. However, talks on culture have witnessed a dip, decreasing steadily since 2013. In contrast, the topic innovation and health are drawing more and more attention along the years.

  1. The association between Speaker and Popularity.
speaker = ted %>% 
  separate(speaker_occupation, into = c("speaker_occupation", "remove"), sep = "/") %>% 
  separate(speaker_occupation, into = c("speaker_occupation", "remove1"), sep = ",") %>% 
  separate(speaker_occupation, into = c("speaker_occupation", "remove2"), sep = ";") %>% 
  select(-remove, -remove1, -remove2) %>% 
  mutate(speaker_occupation = str_to_lower(speaker_occupation)) %>% 
  mutate(speaker_occupation = str_replace(speaker_occupation, "author", "writer"))
speaker %>% 
  group_by(main_speaker) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  head(10) %>% 
  ungroup()%>% 
  mutate(main_speaker = fct_reorder(main_speaker, n)) %>%
  ggplot(aes(x = main_speaker, y = n)) +
  geom_col(aes(fill = main_speaker)) +
  coord_flip() +
  labs(
    title = "visualization of the ten top speakers",
    x = "main speaker",
    y = "number of talks")

From the above table, we find out that statistician Hans Rosling gave most TED Talks among all the TED speakers, totally 9 Talks. As a professor of global health at Sweden’s Karolinska Institute, his current work focuses on dispelling common myths about the so-called developing world, which (he points out) is no longer worlds away from the West. Then it followed by biologist Juan Enriquez, who gave 7 Talks. The range of the number of TED Talks given by top ten speaker is from 5 to 9, probably indicates that these speakers are very popular among viewers and they give TED Talks frequently.

speaker %>% 
  group_by(speaker_occupation) %>% 
  summarize(number_talks = n()) %>% 
  arrange(desc(number_talks)) %>% 
  head(10) %>% 
  pander::pander()
speaker_occupation number_talks
writer 107
artist 46
designer 45
journalist 40
entrepreneur 36
inventor 36
architect 33
psychologist 31
neuroscientist 30
physicist 26

Based on the results in the table above, most speakers attending the TED talks are writer. Totally 107 writers came and gave TED Talks. Then comes to artist and designer, journalist, entrepreneur, inventor, architect, psychologist, neuroscientist and physicist. We are surprised to find out that the top four occupations are all about arts, indicating that people who work with arts are more willing to give a TED Talk and their talks might attract more viewers.

gg_views_occ = speaker %>% 
  filter(speaker_occupation %in% c("writer", "artist", "designer", "journalist", "entrepreneur", "inventor", "architect", "psychologist", "neuroscientist", "physicist")) %>% 
  mutate(speaker_occupation = fct_reorder(speaker_occupation, views)) %>% 
  ggplot(aes(x = speaker_occupation, y = views)) +
  geom_boxplot(aes(fill = speaker_occupation), alpha = .9) +
  ylim(0,7.5e+06)

ggplotly(gg_views_occ) %>% 
  layout(title="Views for talks from top10 speaker occupations",
         yaxis=list(title="Views"),
         xaxis=list(title="Speaker occupation"))
<<<<<<< HEAD
=======
>>>>>>> e5a862a366096e3818b2b4e53714c96ab788fb06

Since there are several extreme points in this polt making the plots too compacted and hard to observe the distribution of each speaker occupation, we limit the range of y axis to make the plot more explicit. From the resulted boxplot, we find that physicist has the lowest median while psychologist has the highest median, indicating that TED Talks given by physicist are more popular and attract more viewers than TED Talks given by speakers with other occupations. In addtion, it can be discovered that there are always some extreme points for each occupation. It is easy to be understood because these talks may be given by the most famous or authoritative people in that field or the content of that talk is relevant to the hottest foucs at that time.

  1. Sentiment analysis
ratings = ted %>%
  select(no, ratings) %>% 
  mutate(ratings = substring(ratings,3, nchar(ratings)-2)) %>% 
  mutate(ratings = str_split(ratings,"\\}, \\{")) %>% 
  unnest(ratings) %>% 
  mutate(rat.words = sub(".*name': '(.*?)',.*", "\\1", ratings),
         rat.words = tolower(rat.words),
         rat.cnt = as.numeric(sub(".*'count': ", "", ratings))) %>%
  select(-ratings)
### read the sentiment dataset from 'bing'
bing_sent = get_sentiments("bing")

#### calculate sentiment value for each observation
rat_sent = ratings %>% 
  rename(word = rat.words) %>% 
  inner_join(bing_sent, by="word") %>% 
  group_by(no,sentiment) %>% 
  summarize(sum_cnt = sum(rat.cnt)) %>% 
  ungroup() %>% 
  spread(sentiment, sum_cnt) %>% 
  mutate(sentiment = (-1) * negative + positive) %>% 
  select(-negative, -positive) %>% 
  left_join(ted) 

### perfrom sentiment analysis
rat_sent %>% 
  mutate(no = factor(no),
         no = fct_reorder(no, sentiment)) %>% 
  ggplot(aes(no, sentiment, fill=views, color=views)) +
    geom_bar(stat = "identity") +
    theme(axis.title.x = element_blank(),
          axis.text.x = element_blank(),
          axis.ticks.x = element_blank()) + 
    scale_fill_viridis() +
    scale_color_viridis()

### try the cube root to reduce skewness
gg_trans_sent = rat_sent %>% 
  mutate(no = factor(no),
         no = fct_reorder(no, sentiment),
         cubert = ifelse(sentiment > 0, sentiment^(1/3), -(-sentiment)^(1/3))) %>% 
  ggplot(aes(no, cubert, fill=views, color=views)) +
    geom_bar(stat = "identity") +
    theme(axis.title.x = element_blank(),
          axis.text.x = element_blank(),
          axis.ticks.x = element_blank()) + 
    scale_fill_viridis() +
    scale_color_viridis()

ggplotly(gg_trans_sent) %>% 
  layout(yaxis=list(title="Cube root of sentiment"))
<<<<<<< HEAD
=======
>>>>>>> e5a862a366096e3818b2b4e53714c96ab788fb06

Here in the sentiment analysis, first we extract the sentiment words and the corresponding counts that are nested in the variable ratings for each observation, and assign them to two new variables rat.words and rat.cnt. Then we match the rating words (rat.words) with the sentiments dataset from ‘bing’, defining the rat.words as ‘positive’ or ‘negative’. For each inspection number(no), we calculate the difference of sum of positive count and sum of negative count and use it as the sentiment score for this observation.

After obtaining the sentiment score, we start to make a plot showing the inspection sentiments and the number of viewers. We first coerce the no variable to factor and reorder it accorrding to the sentiment score and plot the realtionship between inspection number and sentiment score. However, since there are some observations with extremely large score, making the graph rather skewed, we try the cube root of the outcome (sentiment) and plot again to obtain the second graph. It can be seen that most of the ted talks have positive sentiment ratings, since only a small portion on the plot is in the negative side of y axis. Further, we find out that those ted talks with large number of viewers also receive high ratings, since the color yellow and green, which indicate a higher viewers, mostly appear at the right side of the plot, where the sentiment scores are high.

top10_sent = rat_sent %>% 
  arrange(desc(sentiment)) %>% 
  head(10) 

top10_view = ted %>%
  arrange(desc(views)) %>% 
  head(10) 

match = bind_cols(top10_largest_sentiment = top10_sent$no,
                   top10_most_views = top10_view$no) %>% pander::pander()

match
top10_largest_sentiment top10_most_views
202 1
1347 1347
838 678
678 838
1 453
1031 1777
177 202
554 6
531 2115
1164 1417

In this table, the top 10 TED talks with largest positive sentiment and most views are presented separately in two columns. With different rank though, there are five overlapped talks in the two columns, i.e. the talks 1,1347,838,678 and 202. To some extent, it verifies the foregoing speculation that the sentiment can reflect the popularity as well. More specifically, talks with large sentiment values, indicating that people give positive comments on those talks, tend to be popular with large amount of views.

  1. Linear model building
ted.lm = select(rat_sent, no, duration, languages, num_speaker, year, views, sentiment) %>%
  mutate(year = as.numeric(year),
         year = ifelse(year < 2010, "before 2010", 
                       ifelse(year >= 2010 & year < 2015, "between 2010 and 2015", "after 2015")),
         year = factor(year, levels = c("before 2010","between 2010 and 2015","after 2015")))

model1 = lm(views ~ sentiment + year + duration + languages + num_speaker, data = ted.lm)

summary(model1) %>% 
  broom::tidy() %>% 
  pander::pander()
term estimate std.error statistic p.value
(Intercept) -1153059 228950 -5.036 5.079e-07
sentiment 625.1 12.34 50.66 0
yearbetween 2010 and 2015 56150 80416 0.6982 0.4851
yearafter 2015 475678 93312 5.098 3.691e-07
duration 369.3 91.75 4.025 5.867e-05
languages 58346 3791 15.39 3.664e-51
num_speaker 20346 152909 0.1331 0.8942

In this model, we set the arm “before 2010” from the categorical variable “year” as the reference. From the linear modelling result, it can be discovered that only the estimated coefficients for the number of speaker (“num_speaker”) is not significant at 0.05 significance level due to the large p-value. Therefore, we can conclude that there is no significant linear association between the outcome (number of views) and the number of speakers, adjusted for other covariates in the model. Then we drop the num_speaker variable and refit the model.

update(model1, .~. -num_speaker) %>%
  summary() %>% 
  broom::tidy() %>% 
  pander::pander()
term estimate std.error statistic p.value
(Intercept) -1131852 164325 -6.888 7.111e-12
sentiment 625.1 12.34 50.67 0
yearbetween 2010 and 2015 56237 80398 0.6995 0.4843
yearafter 2015 476150 93227 5.107 3.507e-07
duration 369.4 91.73 4.028 5.799e-05
languages 58326 3788 15.4 3.17e-51

The final model is \[Views = -1.13*10^6 + 625.1 * Sentiment + 5.62*10^4 * I\{Year \ between \ 2010 \ and \ 2015\}\\ + 4.76*10^5 * I\{Year \ after \ 2015\} + 369.4 * Duration + 5.83*10^4 * Languages\]

Compared with the reference, the arm “after 2015” presents significant postive estimate, indicating the positive mean difference of number of views for TED talks published after 2015 relative to the number of views for those published before 2010. In other words, TED talks published after 2015 are more popular than those published before 2010. At the same time, it can be noticed that language and duration are also strongly associated with the number of views. Specifically, in terms of “languages”, the mean number of views will increase by 58350 as the number of languages in which the talk is available increases by 1 adjusted for other covariates. As expected, sentiment plays an important role in the model, implying that it is exactly reflection of popularity for TED talks to some extent. Basically, the adjusted R-square is around 60%, suggesting 60% variation in the number of views can be explained by the variation in those covariates. This values is good enough to conlcude that points representing the outcome (number of views) and predictors are well fitted on the linear model.

Conclusion

In this study, the question of why a TED talk becomes popular is explained by analyzing the association between the popularity, represented by the number of views, and other variables provided in the dataset. According to the results, the topic, speaker, duration of a TED talk and the number of language available are all concretely connected with the popularity of that talk. Serveral topics like technology, business and culture are prone to more popularity. The speakers and their occupations also play important roles in drawing attention of population, among which psychologists, writers, scientists and entrepreneur are most popular speaker occupations. It is also revealed that the language available for a TED talk is of importance in its popularity. Therefore, it can be suggested to have more languages available for a talk in order to improve its popularity. At last, popularity for TED talks through time are compared, which reveals the truth that increasing poeple are joining in this sharing talks of thoughts of mind so far and the “ideas worth spreading” are truly propagating.